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(54) Multimode quantizing of the prediction residual in a speech coder 



(57) Linear predictive system with classification of 
LP residual Fourier coefficients into two or more over- 
lapping classes; and each class has its own vector 
quantization codebook(s). And modified use of strong 
and weak predictors to replace a strong predictor follow- 
ing a weak predictor with a weak predictor to insure 
attenuation of error propagation as arise from frame 
erasures. 



Fourier coefficients 




Strong predictor 



Weak prediction 

I 



Weak predictor 



Compare 
yumtfeation 



Select weak/strong 
quantized 
predictor 




Figure lb 



Quantised predict* 



Figure la 



LU 



Printed by Xerox (UK) Business Services 
2.16.7 (HRS)/3.6 



BEST AVAILABLE COPY 



i 



Description 



1 



EP 1 035 538 A2 2 

BRIEF DESCRIPTION OF THE DRAWINGS 



FIELD OF THE INVENTION 

[0001] The present invention relates generally to 
the field of electronic devices, and, more particularly, to 
speech coding, technical transmission, storage, and 
synthesis circuitry and methods. 
[0002] The performance of digital speech systems 
using low bits rates has become increasingly important 
with current and foreseeable digital communications. 
One digital speech method, linear predictive coding 
(LPC), uses a parametric model to mimic human 
speech. In this approach only the parameters of the 
speech model are transmitted across the communica- 
tion channel (or stored), and a synthesizer regenerates 
the speech with the same perceptual characteristics as 
the input speech waveform. Periodic updating of the 
model parameters requires fewer bits than direct repre- 
sentation of the speech signal, so a reasonable LPC 
vocoder can operate at bits rates as low as 2-3 Kbps 
(kilobits per second) whereas the public telephone sys- 
tem uses 64 Kbps (8 bit PCM codewords at 8,000 sam- 
ples per second). See for example, McCree et a!, A 2.4 
Kbit/s MELP Coder Candidate for the New U.S. Federal 
Standard. Proc. IEEE Int.Conf.ASSP 200 (1996) and 
US Patent No.5,699,477. 

[0003] However, the speech output from such LPC 
vocoders is not acceptable in many applications 
because it does not always sound like natural human 
speech, especially in the presence of background 
noise. And there is a demand for a speech vocoder with 
at least telephone quality speech at a bit rate of about 4 
Kbps. Various approaches to improve quality include 
enhancing the estimation of the parameters of a mixed 
excitation linear prediction (MELP) system and more 
efficient quantization of them. See Yeldener et al, A 
Mixed Sinusoidally Excited Linear Prediction coder at 4 
kb/s and Below, Proc. IEEE Int. Conf. 
Acoust.,Speech,Signal Processing (1998) and Shlomot 
et al, Combined Harmonic and Waveform Coding of 
Speech at Low Bit Rates, IEEE ... 585 (1998). 

SUMMARY OF THE INVENTION 

[0004] The present application discloses a linear 
predictive coding method with the residual's Fourier 
coefficients classified into overlapping classes with 
each class having its own vector quantization code- 
book^). 

[0005] Additionally, both strongly predictive and 
weakly predictive codebooks may be used but with a 
weak predictor replacing a strong predictor which other- 
wise would have followed a weak predictor. 
[0006] This has the advantages including mainte- 
nance of low bit rates but with increased performance 
and avoidance of error propagation by a series of strong 
predictors. 



[0007] Specific embodiments of the present inven- 
tion will now be described in further detail, by way of 
5 example, with reference to the accompanying drawings 
in which: 

Figures 1a- 1b are flow diagrams of a preferred 
embodiments. 

10 Figures 2a-2b illustrate preferred embodiment 
coder and decoder in block format; and 
Figures 3a-3d show an LP residual and its Fourier 
transforms. 

is DESCRIPTION OF THE PREFERRED EMBODI- 
MENTS 

[0008] First preferred embodiments classify the 
spectra of the linear prediction (LP) residual (in a MELP 

20 coder) into classes of spectra (vectors) and vector 
quantize each class separately. For example, one first 
preferred embodiment classifies the spectra into long 
vectors (many harmonics which correspond roughly to 
low pitch frequency as typical of male speech) and short 

25 vectors (few harmonics which correspond roughly to 
high pitch frequency as typical of female speech). 
These spectra are then vector quantized with separate 
codebooks to facilitate encoding of vectors with different 
numbers of components (harmonics). Figure 1a shows 

30 the classification flow and includes an overlap of the 
classes. 

[0009] Second preferred embodiments allow for 
predictive coding of the spectra (or alternatively, other 
parameters such as line spectral frequencies or LSFs) 

35 and a selection of either the strong or weak predictor 
based on best approximation but with the proviso that a 
first strong predictor which otherwise follows a weak 
predictor is replaced with a weak predictor. This deters 
error propagation by a sequence of strong predictors of 

40 an error in a weak predictor preceding the series of 
strong predictors. Figure 1b illustrates apredictive cod- 
ing control flow. 

[0010] Figures 2a-2b illustrate preferred embodi- 
ment MELP coding (analysis) and decoding (synthesis) 
45 in block format. In particular, the Linear Prediction Anal- 
ysis determines the LPC coefficients a(j) f j = 1 , 2 M, 

for an input frame of digital speech samples {y(n)} by 
setting: 

50 e(n) = y(n) -X M ^ a(j)y(n-j) ( 1 ) 

and minimizing £e(n) 2 . Typically, M, the order of the lin- 
ear prediction filter, is taken to be about 1 0-1 2; the sam- 
pling rate to form the samples y(n) is taken to be 8000 
55 Hz (the same as the public telephone network sampling 
for digital transmission); and the number of samples 
{y(n)} in a frame is often 160 (a 20 msec frame) or 180 
(a 22.5 msec frame). A frame of samples may be gener- 
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ated by various windowing operations applied to the 
input speech samples. The name "linear prediction" 
arises from the interpretation . of 
e(n) = y(n) - £ MkjSSge$$1 a(j)y(n-j) as the error in pre- 
dicting y(n) by the linear sum of preceding samples 
s M;>j;>i a (iiy( n_ j)- "ft™ 5 minimizing Se(n) 2 yields the 
{aG)} which furnish the best linear prediction. The coef- 
ficients {a(j)} may be converted to LSFs for quantization 
and transmission. 

[001 1 ] The (e(n)} form the LP residual for the frame 
and ideally would be the excitation for the synthesis filter 
1/A(z) where A(z) is the transfer function of equation (1). 
Of course, the LP residual is not available at the 
decoder; so the task of the encoder is to represent the 
LP residual so that the decoder can generate the LP 
excitation from the encoded parameters. 
[001 2] The Band-Pass Voicing for a frequency band 
of samples (typically two to five bands, such as 0-500 
Hz, 500-1000 Hz, 1000-2000 Hz, 2000-3000 Hz, and 
3000^4000 Hz) determines whether the LP excitation 
derived from the LP residual {e(n)} should be periodic 
(voiced) or white noise (unvoiced) for a particular band. 
[0013] The Pitch Analysis determines the pitch 
period (smallest period in voiced frames) by low pass fil- 
tering (y(n)} and then correlating {y(n)} with {y(n+m)} for 
various m; interpolations provide for fractional sample 
intervals. The resultant pitch period is denoted pT 
where p is a real number, typically constrained to be in 
the range 20 to 132 and T is the sampling interval of 1/8 
millisecond. Thus p is the number of samples in a pitch 
period. The LP residual {e(n)} in voiced bands should be 
a combination of pitch-frequency harmonics. 
[0014] Fourier Coeff. Estimation provides coding of 
the LP residual for voiced bands. The following sections 
describe this in detail. 

[0015] Gain Analysis sets the overall energy level 
for a frame. 

[001 6] The encoding (and decoding) may be imple- 
mented with a digital signal processor (DSP) such as 
the TMS320C30 manufactured by Texas Instruments 
which can be programmed to perform the analysis or 
synthesis essentially in real time. 
[0017] Figure 3a illustrates an LP residual {e(n)} for 
a voiced frame and includes about eight pitch periods 
with each pitch period about 26 samples. Figure 3b 
shows the magnitudes of the (E(j)} for one particular 
period of the LP residual, and Figure 3c shows the mag- 
nitudes of the (E(j)} for all eight pitch periods. For a 
voiced frame with pitch period equal to pT, the Fourier 
coefficients peak about 1/pT, 2/pT, 3/pT, .... k/pT, ...;that 
is, at the fundamental frequency 1VpT and harmonics. 
Of course, p may not be an integer, and the magnitudes 
of the Fourier coefficients at the fundamental -frequency 
harmonics, denoted X[1], X[2] X[k], ... must be esti- 
mated. These estimates will be quantized, transmitted, 
and used by the decoder to create the LP excitation. 
[0018] The {X[k]} may be estimated by various 
methods: for example, apply a discrete Fourier trans- 



form to the samples of a single period (or small number 
of periods) of e(n) as in Figures 3b-3c; alternatively, the 
(E(j)} can be interpolated. Indeed, one interpolation 
approach applies a 512-point discrete Fourier transform 

5 to an extended version of the LP residual, which allows 
use of a fast Fourier transform. In particular, extend the 
LP residual {e(n)} of 160 samples to 512 samples by 
setting e 512 (n) = e(n) for n = 0, 1, ... ,159, and 
e 512 (n) = 0 forn = 160, 161, 511. Then thediscrete 

10 Fourier transform magnitudes appear as in Figure 3d 
with coefficients E 512 (j) which essentially interpolate the 
coefficients E(j) of Figures 3b-3c. Estimate the peaks 
X[k] at frequencies k/pT. The preferred embodiment 
only uses the magnitudes of the Fourier coefficients, 

75 although the phases could also be used. Because the 
LP residual components {e(n)J are real, the discrete 
Fourier transform coefficients (E(j)} are conjugate sym- 
metric: E(k) = E*(N-k) for an N-point discrete Fourier 
transform. Thus only half of the (E(j)} need be used for 

20 magnitude considerations. 

[0019] Once the estimated magnitudes of the Fou- 
rier coefficients X[k] for the fundamental pitch frequency 
and harmonics k/pT have been found, they must be 
transmitted with a minimal number of bits. The preferred 

25 embodiments use vector quantization of the spectra. 
That is, treat the set of Fourier coefficients X[1], X[2]. ... 
X[k], ... as a vector in a multi-dimensional quantization, 
and transmit only the index of the output quantized vec- 
tor. Note that there are [p] or [p]+1 coefficients, but only 

30 half of the components are significant due to their con- 
jugate symmetry. Thus for a short pitch period such as 
pT = 4 milliseconds (p = 32), the fundamental frequency 
1/pT (= 250 Hz) is high and there are 32 harmonics, but 
only 16 would be significant (not counting the DC com- 

35 ponent). Similarly, for a long pitch period such as pT = 
12 milliseconds (p = 96), the fundamental frequency (= 
83 Hz) is low and there are 48 significant harmonics. 
[0020] In general, the set of output quantized vec- 
tors may be created by adaptive selection with a cluster- 

40 ing method from a set of input training vectors. For 
example, a large number of randomly selected vectors 
(spectra) from various speakers can be used to form a 
codebook (or codebooks with multistep vector quantiza- 
tion) . Thus a quantized and coded version of an input 

45 spectrum X[1], X[2], ... X[k], ... can be transmitted as the 
index in the codebook of the quantized vector and which 
maybe 20 bits. 

[0021] As illustrated in Figure 1a, the first preferred 
embodiments proceed with vector quantization of the 

50 Fourier coefficient spectra as follows. First, classify a 
Fourier coefficient spectrum (vector) according to the 
corresponding pitch period: if the pitch period is less 
than 55T, the vector is a "short" vector, and if the pitch 
period is more than 45T. the vector is a "long" vector. 

55 Some vectors will qualify as both short and long vectors. 
Vector quantize the short vectors with a codebook of 20- 
component vectors, and vector quantize the long vec- 
tors with a codebook of 45-component vectors. As 
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described previously, conjugate symmetry of the Fourier 
- coefficients implies only the first half of the vector com- 
ponents are significant and used. And for short vectors 
with less than 20 significant components, expand to 20 
components by appending components equal to 1. 
Analogously for long vectors with fewer than 45 signifi- 
cant components, expand to 45 components by 
appending components equal to 1 . Each codebook has 
2 20 output quantized vectors, so 20 bits will index the 
output quantized vectors in each codebook. One bit 
could be used to select the codebook, but the pitch is 
transmitted and can be used to determine whether the 
20 bits are long or short vector quantization. 
[0022] For a vector classified as both short and 
long, use the same classification as the preceding 
frame's vector; this avoids discontinuities and provides 
a hysteresis by the classification overlap. Further, if the 
preceding frame was unvoiced, then take the vector as 
short if the pitch period is less than 50T and long other- 
wise. 

[0023] Apply a weighting factor to the metric defin- 
ing distance between vectors. The distance is used both 
for the clustering of training vectors (which creates the 
codebook) and for the quantization of Fourier compo- 
nent vectors by minimum distance. In general, define a 
distance between vectors X-, and X 2 by 
d(X 1t X 2 ) = (X 1 -X 2 ) T W(X 1 -X 2 ) with W a matrix of 
weights. Thus define matrices W short for short vectors 
and matrices W iong for long vectors; further, the weights 
may depend upon the length of the vector to be quan- 
tized. Then for short vectors take W short Q,k] very small 
for either j or k larger than 20; this will render the com- 
ponents X-|[k] and X 2 [k] irrelevant for k larger than 20. 
Further, take W short D,k] decreasing as j and k increase 
from 1 to 20 to emphasize the lower vector components. 
That is, the quantization will depend primarily upon the 
Fourier coefficients for the fundamental and low har- 
monics of the pitch frequency. Analogously, take 
WtongD.k] very small for j or k larger than 45. 
[0024] Further, the use of predictive coding could 
be included to reduce the magnitudes and decrease the 
quantization noise as described in the following. 

Predictive coding 

[0025] A differential (predictive) approach will 
decrease the quantization noise. That is, rather than 
vector quantize a spectrum X[1], X[2], ... X[k], .... first 
generate a prediction of the spectrum from the preced- 
ing one or more frames* quantized spectra (vectors) and 
just quantize the difference. If the current frame's vector 
can be well approximated from the prior frames' vectors, 
then a "strong" prediction can be used in which the dif- 
ference between the current frame's vector and a strong 
predictor may be small. Contrarily, if the current frame's 
vector cannot be well approximated from the prior 
frames' vectors, then a "weak" prediction (including no 
prediction) can be used in which the difference between 



the current frame's vector and a predictor may be large. 
For example, a simple prediction of the current frame's 
vector X could be the preceding frame s quantized vec- 
tor Y, or more generally a multiple aY with a a weight 

5 factor (between 0 and 1). Indeed, a could be a diagonal 
matrix with different factors for different vector compo- 
nents. For a values in the range 0.7-1.0, the predictor 
aY is close to Y and if also close to X, the difference 
vector X-aY to be quantized is small compared to X. 

io This would be a strong predictor, and the decoder 
recovers an estimate for X by Q(X-aY) + aY with the 
first term the quantized difference vector X-aY and the 
second term from the previous frame and likely the 
dominant term. Conversely, for a values in the range 

75 0.0-0.3, the predictor is weak in that the difference vec- 
tor X-aY to be quantized is likely comparable to X. In 
fact, a = 0 is no prediction at all and the vector to be 
quantized is X itself. 

[0026] The advantage of strong predictors follows 

20 from the fact that with the same size codebooks, quan- 
tizing something likely to be small (strong-predictor dif- 
ference) will give better average results than quantizing 
something likely to be large (weak-predictor difference). 
[0027] Thus train four codebooks: (1) short vectors 

25 and strong prediction, (2) short vectors and weak pre- 
diction, (3) long vectors and strong prediction, and (4) 
long vectors and weak prediction. Then process a vec- 
tor as illustrated in the top portion of Figure 1b: first the 
vector X is classified as short or long; next, the strong 

30 and weak predictor vectors, X strong and X weak , are gen- 
erated from previous frames' quantized vectors and the 
strong predictor and weak predictor codebooks are 
used for vector quantization of X-Xs trong and X-X^ak, 
respectively. Then the two results 

35 (Q (X-X strong ) + X strong and Q (X-X weak ) + X weak ) are 
compared to the input vector and the better approxima- 
tion (strong or weak predictor) is selected. A bit is trans- 
mitted (to indicate whether a strong or weak predictor 
was used) along with the 20-bit codebook index for the 

40 quantization vector. The pitch determines whether the 
vector was long or short. 

[0028] In a frame erasure the parameters (i.e., 
LSFs, Fourier coefficients, pitch, ...) corresponding to 
the current frame are considered lost or unreliable and 

45 the frame is reconstructed based on the parameters 
from the previous frames. In the presence of frame 
erasures the error resulting from missing a set of 
parameters will propagate throughout the series of 
frames for which a strong prediction is used. If the error 

so occurs in the middle of the series, the exact evolution of 
the predicted parameters is compromised and some 
perceptual distortion is usually introduced. When a 
frame erasure happens within a region where a weak 
predictor is consistently selected, the effect of the error 

55 will be localized (it will be quickly reduced by the weak 
prediction). The largest degradation in the recon- 
structed frame is observed whenever a frame erasure 
occurs for a frame with a weak predictor followed by a 
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series of frames for which a strong predictor is chosen. 
In this case the evolution of the parameters is builtup on 
a parameter very different from that which is supposed 
to start the evolution, 

[0029] Thus a second preferred embodiment ana- 
lyzes the predictors used in a series of frames and con- 
trols their sequencing. In particular, for a current frame 
which otherwise would use a strong predictor immedi- 
ately following a frame which used a weak predictor, 
one preferred embodiment modifies the current frame to 
use the weak predictor but does not affect the next 
frame's predictor. Figure lb illustrates the decisions. 
[0030] A simple example will illustrate the effect of 
this preferred embodiment Presume a sequence of 
frames with Fourier coefficient vectors X 1 , X 2 , X 3 , ... and 
presume the first frame uses a weak predictor and the 
second, third, fourth, ... frames use strong predictors, 
but the preferred embodiment replaces the second 
frame's strong predictor with a weak predictor. Thus the 
transmitted quantized difference vector for the first 
frame is Q(X r X 1weak ) and without erasure the decoder 
recovers X-j as Q(X , -X 1weak ) + X lWeak with the first 
term likely the dominant term due to weak prediction. 
Similarly, the usual decoder recovers X 2 as 
Q(X 2 -X 2strong ) + X 2strong with the second term domi- 
nant, and analogously for X 3 , X 4 . ... in contrast, the pre- 
ferred embodiment decoder recovers X 2 as 
Q(X 2 -X 2weak ) + X 2weak but with the first term likely 
dominant. 

[0031 ] Note that the decoder recreates X 1 weak from 
the preceding reconstructed frames' vectors Xq, X 1t ... , 
and similarly for X 2strong and X 2wea k recreated from 
reconstructed X 1f Xq, .... and likewise for the other pre- 
dictors. 

[0032] Now with an erasure of the first frame 
parameters the vector Q(X r X 1weak ) is lost and the 
decoder reconstructs the X-, by something such as just 
repeating reconstructed Xq from the prior frame. How- 
ever, this may not be a very good approximation 
because originally a weak predictor was used. Then for 
the second frame, the usual decoder reconstructs X 2 by 

QCXg-^strongJ+^strong with 

Y 2strong the strong predictor recreated from Xq, Xq, ... 
rather than from X 1f Xq, ... because X-, was lost and 
replaced by possibly poor approximation Xq. Thus the 
error would roughly be X 2stron g * Y 2strong which likely is 
large due to the strong predictor being the dominant 
term compared to the difference term Q(X 2 -X 2strong ). 
And this also applies to the reconstruction of X 3 , X 4 ,... 
[0033] Contrarily, the preferred embodiment recon- 
structs X 2 by Q(X 2 -X 2weak ) + Y 2weak with Y 2strong the 
weak predictor recreated from X<>. Xo, ... rather than 
from X 1 , X 0 , . .. again because X-i was lost and replaced 
by possibly poor approximation X 0 . Thus the error would 
roughly be X 2wea k - Y 2weak which likely is small due to 
the weak predictor being the smaller term compared to 
the difference term Q(X 2 -X 2weak ). And this smaller error 
also applies to the reconstruction of X 3 , X 4 , 



[0034] Indeed for the case of the predictors 

X 2slrong = aX 1 wlth a = 0 8 and X 2weak = aX , with a 

= 0.2, the usual decoder error would be 0.8(X-j - X 0 ) for 
reconstruction of X 2 and the preferred embodiment 

5 decoder error would be 0.2(X<| - Xq). 

[0035] Alternative second preferred embodiments 
modify two (or more) successive frame's strong predic- 
tors after a weak predictor frame to be weak predictors. 
That is, a sequence of weak, strong, strong, strong, ... 

w would be changed to weak, weak, weak, strong, ... 
[0036] The foregoing replacement of strong predic- 
tors by weak predictors provides a tradeoff of increased 
error robustness for slightly decreased quality (the weak 
predictors being used in place of better strong predic- 

75 tors): 

Claims 

1 . A method of Linear predictive system coding, com- 
20 prising the steps of: 

classifying LP residual Fourier coefficients into 
two or more classes of vectors; 
for each class of vectors providing at least one 
25 vector quantization codebook; and 

encoding said vectors with said codebooks. 

2. The coding of claim 1, wherein said classes of vec- 
tors overlap and a vector in two or more classes is 

30 encoded using the class of a vector in a preceding 
frame. 

3. A method of Linear predictive system decoding, 
comprising the steps of: 
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interpreting LP residual Fourier coefficients as 
members of two or more overlapping classes of 
vectors with each class having at least one vec- 
tor quantization codebook; and 
decoding an encoded vector using said code- 
books. 

4. An encoding method using strong and weak predic- 
tors, comprising the step of: 

replacing a strong predictor following a weak 
predictor with a weak predictor. 
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